[ML] Improve hyperparameter tuning performance #1941

tveasey · 2021-07-02T11:02:20Z

Our default hyperparameter tuning becomes extremely runtime expensive on large data sets. Currently, our best solution is to suggest using a low train fraction, but this means final train (which is comparatively a couple of orders of magnitude faster) also sees less data. This can be addressed by using a smaller proportion than 1 - 1 / # folds data for train during hyperparameter tuning. This provides us with both an escape hatch to avoid pathological run time in the case someone runs against a very large data set and also the means of allowing the user to specify train fast, but sacrifice a small amount of accuracy. I tried this out on a few different data sets and the following result is typical:

train fraction	run time / s	R^2
0.05	31	0.9969
0.3	48	0.9985
0.5 *	64	0.9986

* This is the default behaviour. The performance gain is not proportional to the fraction of data used because:

There are fixed overheads
We already downsample when training and tune this parameter, preferring more downsampling if the accuracy isn't significantly affected.

(Note compared to downsampling while training this has a couple of advantages if runtime is a priority: it guaranties worst case performance and it improves caching since the same rows are always selected rather than a different random sample for each tree.)

This also introduces a hard limit on the maximum number of rows we will use for hyperaparameter tuning and enforces this by selecting a lower train fraction when this would be exceeded.

A second improvement relates to initial hyperparameter search. I observed that

We often don't significantly improve on this during final BO driven hyperparameter optimisation
We leave some performance on the floor because we don't get a good estimate of true minimum of the loss in the best value selected by line search for large data sets.

The second problem is slightly tricky. For small data sets losses at different hyperparameter settings are typically noisy and our current strategy of fitting a parabola through the points works well. For large data sets the loss curve is smooth, but often non-parabolic over the range we explore. Fitting a Lowess regression to the loss curve instead performs pretty well for finding the "true" minimum (also better than interpolation by GP which I tried). The issue with different amounts of noise is handled by choosing the amount of smoothing by maximum likelihood. This change generally gives us a small performance bump and importantly means we can often likely skip fine tuning altogether. As such I now allow max_optimization_rounds_per_hyperparameter to be set to zero.

One last issue was when we characterised loss variance across the folds we included variance in the mean loss. Particularly, for small data sets this could be due to sampling affects: one test fold contains examples which are harder to predict. Over multiple rounds we can estimate this effectively and remove this component of the variance when we fit the GP.

… to avoid runtime blowup

…pling bias

…and unit test

valeriy42

Good work on improving the performance by integrating function smoother into the minimizer! I have a couple of comments mostly regarding readability.

include/maths/CLowess.h

valeriy42 · 2021-07-08T08:42:46Z

lib/maths/unittest/CBoostedTreeTest.cc

+    BOOST_REQUIRE_CLOSE_ABSOLUTE(
+        0.0, bias, 4.0 * std::sqrt(noiseVariance / static_cast<double>(trainRows)));
+    // Good R^2...
+    BOOST_TEST_REQUIRE(rSquared > 0.98);


lib/maths/unittest/CLowessTest.cc

valeriy42 · 2021-07-08T09:00:37Z

lib/maths/unittest/CLowessTest.cc

+    // Test minimization of some training loss curves from boosted tree hyperparameter
+    // line searches for:
+    //   1. Miniboone
+    //   2. Car-parts
+    //   3. Boston


good work! 🚀

…imilar

…l meta data

tveasey · 2021-07-09T14:31:54Z

Thanks for the review @valeriy42! I think I've addressed everything. I also disabled writing out extra stats and model metadata for the time being. This requires changes to the Java code as well and I'll make those together.

valeriy42

Good work and thank you for this explanation of handling mean-variance. I have just a couple of minor comments: take it or leave it. LGTM 🚀

valeriy42 · 2021-07-12T05:19:57Z

include/maths/CLowessDetail.h

+    TSizeVecVec testingMasks;
+    this->setupMasks(numberFolds, trainingMasks, testingMasks);
+
+    TDoubleVec K(17);


I would have preferred to call m_K an m_SmoothingParameter. Due to our coding standards, it has to be a capital K, although it relates to the small k in the formulas.

lib/maths/CBoostedTreeImpl.cc

valeriy42 · 2021-07-12T05:29:15Z

lib/maths/CBayesianOptimisation.cc

+    // So what are we doing here? When we supply function values we also supply their
+    // error variance. Typically these might be the mean test loss function across
+    // folds and their variance for a particular choice of hyperparameters. Sticking
+    // with this example, the variance allows us to estimate the error w.r.t. the
+    // true generalisation error due to finite sample size. We can think of the source
+    // of this variance as being due to two effects: one which shifts the loss values
+    // in each fold (this might be due to some folds simply having more hard examples)
+    // and another which permutes the order of loss values. A shift in the loss function
+    // is not something we wish to capture in the GP: it shouldn't materially affect
+    // where to choose points to test since any sensible optimisation strategy should
+    // only care about the difference in loss between points, which is unaffected by a
+    // shift. More formally, if we assume the shift and permutation errors are independent
+    // we have for losses l_i, mean loss per fold m_i and mean loss for a given set of
+    // hyperparameters m that the variance is
+    //
+    //   sum_i{ (l_i - m)^2 } = sum_i{ (l_i - m_i + m_i - m)^2 }
+    //                        = sum_i{ (l_i - m_i)^2 } + sum_i{ (m_i - m)^2 }
+    //                        = "permutation variance" + "shift variance"          (1)
+    //
+    // with the cross-term expected to be small by independence. (Note, the independence
+    // assumption is reasonable if one assumes that the shift is due to mismatch in hard
+    // examples since the we choose folds independently at random.) We can estimate the
+    // shift variance by looking at mean loss over all distinct hyperparameter settings
+    // and we assume it is supplied as the parameter m_ExplainedErrorVariance. It should
+    // also be smaller than the variance by construction although for numerical stability
+    // we prevent the difference becoming too small. As discussed, here we wish return
+    // the permutation variance which we get by rearranging (1).
+


Nice 👍 Thank you for the explanation! 📖

…1960) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.

Our default hyperparameter tuning becomes extremely runtime expensive on large data sets. Currently, our best solution is to suggest using a low train fraction, but this means final train (which is comparatively a couple of orders of magnitude faster) also sees less data. This can be addressed by using a smaller proportion than 1 - 1 / # folds data for train during hyperparameter tuning. This provides us with both an escape hatch to avoid pathological run time in the case someone runs against a very large data set and also the means of allowing the user to specify train fast, but sacrifice a small amount of accuracy. This also introduces a hard limit on the maximum number of rows we will use for hyperaparameter tuning and enforces this by selecting a lower train fraction when this would be exceeded. For small data sets losses at different hyperparameter settings are typically noisy and our current strategy of fitting a parabola through the points works well. For large data sets the loss curve is smooth, but often non-parabolic over the range we explore. Fitting a LOWESS regression to the loss curve instead performs pretty well for finding the "true" minimum (also better than interpolation by GP which I tried). The issue with different amounts of noise is handled by choosing the amount of smoothing by maximum likelihood. This change generally gives us a small performance bump and importantly means we can often likely skip fine tuning altogether. As such I now allow max_optimization_rounds_per_hyperparameter to be set to zero. One last issue was when we characterised loss variance across the folds we included variance in the mean loss. Particularly, for small data sets this could be due to sampling affects: one test fold contains examples which are harder to predict. Over multiple rounds we can estimate this effectively and remove this component of the variance when we fit the GP.

…lastic#1960) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of elastic#1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.

…1992) This makes two changes to deal better with small data sets highlighted by a failure in our QA suite as a result of #1941. In particular, 1. We could miss out rare classes altogether from our validation set for small data sets. 2. We can lose a lot of accuracy by over restricting the number of features we use for small data sets. Problem 1 is a result of the stratified sampling we perform. If a class is rare and data set is small we could choose never to sample it in the validation set because it could constitute fewer than one example per fold. In this case, the fraction of each class is changing significantly in the remaining unsampled set for each fold we sample, but we compute the desired class counts once upfront based on their overall frequency. We simply need to recompute desired counts per class based on the frequencies in the remainder in the loop which samples each new fold. Problem 2 requires that we allow ourselves to use more features than are implied by our default constraint of having n examples per feature for small data sets. Since we automatically remove nuisance features based on their MICe with the target we typically don't suffer loss in QoR from allowing ourselves to select extra features. Furthermore, for small data sets runtime is never problematic. For the multi-class classification problem which showed up this problem accuracy increases from around 0.2 to 0.9 as a result of this change.

tveasey added 7 commits February 22, 2021 10:14

WIP

2adce69

Merge branch 'master' into select-data-size

4f5d558

Restrict the maximum number of rows used during hyperparameter tuning…

b2af714

… to avoid runtime blowup

Allow one to disable fine tuning entirely for fast mode

26070c4

Uncouple training fraction parameter from the number of folds

81d3ffd

Adjust the validation loss variance estimate to remove affects of sam…

04248ee

…pling bias

Formatting

f72dd4c

tveasey added >enhancement review :ml/DataFrameAnalysis labels Jul 2, 2021

tveasey requested a review from valeriy42 July 2, 2021 11:02

Docs

fc0a3bc

tveasey added the v7.15.0 label Jul 2, 2021

tveasey added 7 commits July 2, 2021 13:35

Avoid infinite loop

78f6e37

Correct handling of eta growth rate per tree

caa7c82

Correct edge case test

b46c76e

Test threshold

7318193

Handle the case we can't sample train/test folds without replacement …

e55ea41

…and unit test

Handle edge case creating train/test splits with very little data

dd002c3

Slightly relax tests to pass on all platforms

37d4690

valeriy42 reviewed Jul 8, 2021

View reviewed changes

tveasey added 9 commits July 8, 2021 14:33

Review comments

28f22f4

Review comments

4f3e3f9

Explain p.

5d4edba

Explain poly

5748ce1

Add explanation of mechanics of fit

c252b24

Make k dependency clear

9a7feea

Document test interface

5b1a018

Names, explanation and coding style guideline fixes

93d3264

Explicit capture

ae45379

tveasey added 12 commits July 8, 2021 16:17

Typo

efdadc0

Capture by reference

59c9add

Rename

74c27f9

Update comment to reflect the current behaviour

ca1d910

Name variable for readability

40eae57

Typedef

92de10f

Define small constant used to prefer fast training if test error is s…

d0be22f

…imilar

We should record the fraction and number of training rows in the mode…

a380b20

…l meta data

Handle case we don't need to sample for last fold

06460e6

Merge branch 'master' into select-data-size

8e98de0

Add an explanation of variance treatment in BO

ad037ec

Comments

e58ed73

valeriy42 approved these changes Jul 12, 2021

View reviewed changes

Move fraction of training data into its own section in instrumentation

e0a61bf

tveasey merged commit 09d5444 into elastic:master Jul 12, 2021

tveasey deleted the select-data-size branch July 12, 2021 13:13

tveasey mentioned this pull request Jul 13, 2021

[ML] Write out training fraction and number of train rows in model meta data #1947

Closed

tveasey mentioned this pull request Jul 26, 2021

[ML] Improve regression and classification QoR for small data sets #1960

Merged

tveasey mentioned this pull request Aug 16, 2021

[7.x][ML] Improve hyperparameter tuning performance #1990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve hyperparameter tuning performance #1941

[ML] Improve hyperparameter tuning performance #1941

tveasey commented Jul 2, 2021

valeriy42 left a comment

valeriy42 Jul 8, 2021

valeriy42 Jul 8, 2021

tveasey commented Jul 9, 2021

valeriy42 left a comment

valeriy42 Jul 12, 2021

valeriy42 Jul 12, 2021

[ML] Improve hyperparameter tuning performance #1941

[ML] Improve hyperparameter tuning performance #1941

Conversation

tveasey commented Jul 2, 2021

valeriy42 left a comment

Choose a reason for hiding this comment

valeriy42 Jul 8, 2021

Choose a reason for hiding this comment

valeriy42 Jul 8, 2021

Choose a reason for hiding this comment

tveasey commented Jul 9, 2021

valeriy42 left a comment

Choose a reason for hiding this comment

valeriy42 Jul 12, 2021

Choose a reason for hiding this comment

valeriy42 Jul 12, 2021

Choose a reason for hiding this comment